Partitioned ingest API#4317
Conversation
- CommitIngestPartitions: detect open outer transaction via PQtransactionStatus and scope the commit with SAVEPOINT / RELEASE instead of BEGIN / COMMIT, so calling Commit does not silently close or roll back the caller's outer transaction. Reject error/unknown transaction states with INVALID_STATE. - Scope the abort test's leftover-staging query to the current ingest handle's prefix instead of matching all adbc_stg_* tables in the schema, so the test isn't flaky against prior runs or parallel tests. - Add a static_assert that the generated staging table name fits under PostgreSQL's default NAMEDATALEN, so any future widening of the id or prefix fails at compile time instead of silently truncating and causing name collisions during abort. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…ridging - Guard the four new AdbcConnection*IngestPartitions wrappers against NULL connection (and Begin/Write out-params) before touching private_driver, so bad callers get ADBC_STATUS_INVALID_ARGUMENT instead of a null dereference. - Make kSupportedVersions the single source of truth for the AdbcLoadDriverFromInitFunc version check (std::find) instead of a switch that has to be kept in sync manually. - Add a gtest that loads a pre-1.2.0 driver through the manager at ADBC_VERSION_1_2_0 and asserts each new ingest entry point routes to the NOT_IMPLEMENTED default stub (plus the struct-size sanity check now tracks ADBC_DRIVER_1_2_0_SIZE). - Docstring for AdbcConnectionWriteIngestPartition now notes that a failed write may leave partial server-side state and the caller must still Abort the handle to release staging resources. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Scope the CommitIngestPartitions savepoint to a per-call name derived from the handle id so it cannot alias a caller-managed savepoint, and always RELEASE the savepoint after ROLLBACK TO SAVEPOINT so the caller's savepoint stack is restored to its pre-call shape on failure. Add tests covering the SAVEPOINT branch (visibility inside the outer transaction and rollback semantics) and the aborted-transaction rejection path. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
- Reject NULL handle/data in AdbcConnectionWriteIngestPartition and NULL handle plus NULL receipts/receipt_lens (when num_receipts > 0) in Commit/Abort, returning ADBC_STATUS_INVALID_ARGUMENT before any driver dispatch. - Add gtests that exercise the public AdbcConnection*IngestPartitions entry points: one case proves a NULL connection short-circuits to INVALID_ARGUMENT, another wires a real connection to a 1.2.0-loaded driver and confirms each wrapper dispatches to the FILL_DEFAULT NOT_IMPLEMENTED stub. - Add a test that AdbcLoadDriverFromInitFunc rejects an unrecognized version constant with ADBC_STATUS_NOT_IMPLEMENTED, guarding the std::find / kSupportedVersions check from silent regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…savepoint abort path Add a compile-time guard for the "adbc_ingest_commit_" savepoint name length against Postgres NAMEDATALEN-1, mirroring the existing guard on staging table identifiers so a future rename of the prefix cannot silently produce truncated, aliasing savepoint names. Annotate the abort_ingest lambda to document that cleanup errors are intentionally discarded in favor of preserving the first-cause message already stored in the caller's error. Harden the CommitInsideOuterTransactionUsesSavepoint test by explicitly forcing libpq into PQTRANS_INTRANS before Commit so the savepoint branch is exercised even if the driver ever defers BEGIN. Add a new test (CommitFailureInOuterTxnReleasesSavepoint) that triggers an INSERT failure mid-commit while inside an outer transaction, then verifies the outer transaction remains usable and that the driver's savepoint has been RELEASEd (not leaked onto the caller's stack) — covering the failure path of the ROLLBACK TO / RELEASE sequence that the savepoint isolation motivated. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…onnection The existing dispatch test only exercised the new handle/data/receipts NULL guards via a NULL connection, so a future reorder that hides those checks behind the connection check would not regress the suite. Add a per-wrapper case that supplies a valid connection but NULL handle/data/out_handle/ out_receipt (and num_receipts > 0 with NULL receipts/receipt_lens for Commit/Abort) to pin each guard independently. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Make IngestHandle::kCommitSavepointPrefix and an inline HexId16 helper the single source of truth for the commit-savepoint name, defined header-inline so tests linking against the shared driver library can reuse them. The NAMEDATALEN-1 static_assert now guards the same constant the construction site uses, so a future rename can no longer drift past the assert. Tighten the savepoint-abort regression test accordingly: it now builds the probed savepoint name from kCommitSavepointPrefix + HexId16 instead of duplicating the literal and hex encoding, bounds-checks the receipt-decoded schema/table lengths before advancing the read pointer, and explicitly recovers the outer transaction (asserting the rollback succeeds) after the ROLLBACK TO probe leaves libpq in PQTRANS_INERROR. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Replace the .cc-local HexId with the header-inline internal::HexId16 in StagingPrefix so both call sites share one implementation, and derive kHexIdLen from sizeof(IngestHandle::ingest_id) so the NAMEDATALEN assert cannot drift if the id width ever changes. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
Do you want to target this against the spec-1.2.0 branch instead? |
| struct AdbcConnection* connection, const char* target_catalog, | ||
| const char* target_db_schema, const char* target_table, const char* mode, |
There was a problem hiding this comment.
I'd rather we use the existing options here instead of duplicating them into the signature
| /// write; that happens at Commit (commit it) or Abort (drop it). | ||
| /// | ||
| /// \since ADBC API revision 1.2.0 | ||
| struct AdbcIngestReceipt { |
There was a problem hiding this comment.
It would be good to explain the receipt/handle explicitly here instead of leaving it implicit from the below definitions.
Additionally I think the comments could generally be cleaned up.
| /// Abort is best-effort. If cleanup is incomplete, the driver | ||
| /// returns a warning status and orphaned storage may remain; it is |
There was a problem hiding this comment.
Perhaps this would leverage ConnectionSetWarningHandler?
| /// the driver's responsibility to provide housekeeping (e.g. TTL, | ||
| /// background GC, or documented manual cleanup). Callers may also |
There was a problem hiding this comment.
This sounds more like the semantics of the backend system
CurtHagenlocher
left a comment
There was a problem hiding this comment.
Some initial thoughts.
| AdbcStatusCode (*ConnectionWriteIngestPartition)( | ||
| struct AdbcConnection*, const uint8_t*, size_t, struct ArrowArrayStream*, | ||
| struct AdbcIngestReceipt*, struct AdbcError*); | ||
| AdbcStatusCode (*ConnectionCommitIngestPartitions)( |
There was a problem hiding this comment.
Repeating my comment from the mailing list, is this use of Commit going to be misleading around its relationship with a database transactional commit?
| /// Abort is best-effort. If cleanup is incomplete, the driver | ||
| /// returns a warning status and orphaned storage may remain; it is |
There was a problem hiding this comment.
Perhaps this would leverage ConnectionSetWarningHandler?
| /// call Abort if the coordinator crashed and was restarted without | ||
| /// the original receipts. |
There was a problem hiding this comment.
Is it worth distinguishing the case of "Abort an unknown handle" from the case of "something went wrong during Abort" with a specific error code for the former?
| struct AdbcIngestHandle; | ||
| struct AdbcIngestReceipt; |
There was a problem hiding this comment.
The only role these seem to play is to have an API that allows the corresponding opaque buffer to be deallocated. We're also not getting any type safety from having two of them because both are just byte arrays post-serialization. Could we simplify and generalize a little and declare a single type for this e.g. AdbcSerializableHandle?
| /// point to an AdbcDriver. | ||
| /// | ||
| /// \since ADBC API revision 1.2.0 | ||
| #define ADBC_VERSION_1_2_0 1002000 |
There was a problem hiding this comment.
Can you please resubmit this as a PR against the spec-1.2.0 branch where this should already be defined?
demo PR for a new partitioned ingest API.